Fine-tuning an LLM on your texts: a simulation of you

This is the conclusion of my guide to fine-tuning an LLM on your text message history. It’s time for text generation. Over the holidays, I did this myself with my 240,000 text history with entertaining results.

The journey so far:
In part 1, we set up the environment and downloaded text / WhatsApp history
Then, in part 2, we organized and filtered down the texts
And in part 3, we curated, encrypted and uploaded the datasets
In the last installment, part 4, we ran QLoRA fine-tuning on a Llama 2 base model.

Prepare for inference

I used a fresh Jupyter notebook, so I could experiment with Text Generation in parallel with training. Start with the constants, matching those used during training:

BASE_MODEL_NAME = "meta-llama/Llama-2-7b-chat-hf"
PROJECT_NAME = 'messages'
RUN_NAME = 'v1'
MODEL_NAME = f"your-hf-username/{PROJECT_NAME}-{RUN_NAME}"
MAX_LENGTH = 200
ME = "Edward" # your name here

Now, the necessary installs and imports. If you’re not familiar with the libraries we’re using, check them out in the HF tutorials or API docs.

# installs
!pip install -q  torch peft bitsandbytes transformers trl accelerate sentencepiece

# imports
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import LoraConfig, PeftModel, PeftConfig
from IPython.display import clear_output

from huggingface_hub import notebook_login
notebook_login()

Load your fine-tuned model

Time to download your lovingly-crafted fine-tuned Llama 2 from Hugging Face:

tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL_NAME, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
)

base_model = AutoModelForCausalLM.from_pretrained(
    BASE_MODEL_NAME,
    quantization_config=quant_config,
    device_map="auto",
)

base_model.config.use_cache = False
base_model.config.pretraining_tp = 1

model = PeftModel.from_pretrained(
    base_model,
    MODEL_NAME,
    tokenizer=tokenizer,
    max_seq_length=MAX_LENGTH,
)

Text Generation

You’ll remember from Part 1 that Hugging Face provides super-convenient pipelines for generation. I quickly discovered, though, that I needed finer grained control over the generation process. It was this work above all else that made the greatest difference in results quality.

I started with a helper method:

SUPPRESS_TOKENS = [26308, 243, 162,   155,   149, 160, 47, 18610]
BAD_WORDS = [[26308], [243], [162], [155], [149], [160], [47], [18610], [229,159,171]]


def generate_next(text, min_tokens, max_tokens):
  inputs, outputs = [], []
  attempt = ""
  final_tokens = []
  try:
    inputs = tokenizer(text, return_tensors="pt").to('cuda')
    outputs = model.generate(**inputs,
                                    max_new_tokens=max_tokens,
                                    min_new_tokens=min_tokens,
                                    return_dict_in_generate=True,
                                    output_scores=False,
                                    no_repeat_ngram_size=6,
                                    suppress_tokens = SUPPRESS_TOKENS,
                                    bad_words_ids = BAD_WORDS)
    sequence = outputs['sequences'][0]
    attempt = tokenizer.decode(sequence, skip_special_tokens=True)
    final_tokens = sequence[:-10]
  finally:
    del inputs
    del outputs
    torch.cuda.empty_cache()
  return attempt, final_tokens

I’m using SUPPRESS_TOKENS and BAD_WORDS to prevent the model from generating some tokens by setting their probabilities to 0 before sampling. Token 18610 is “***” – it’s the token we substituted for an image. I found it was better to suppress these otherwise the model sometimes got trapped in a loop furiously exchanging images with itself.

You’ll also notice that I included some cleanup after the generation, to stop the GPU running out of memory.

One final, final piece of data massaging

I realize this next chunk of code is quite a lot to take in, but I promise it will be worth the investment. I often found that generating conversations could get messy. For example, the model sometimes started to hallucinate, and the format went awry and needed to be repaired. Shades of this iconic Arnie moment:

So I decided to work super carefully, generating a few tokens at a time, and reconstructing the conversation at each step. I created these two classes to help: Message and Conversation.

Along with repairing the conversation, this code truncates to the last 10 messages (see NLI_MAX_MESSAGES), which seems sufficient to keep the context of the conversation. Too much more than this, and the LLM starts to lose the plot… try bumping up that number to descend into your very own TWO WEEKS meltdown.

class Message:
  def __init__(self, sender=None, message=None, text=None):
    self.is_complete = True
    if message is not None:
      self.sender = ME if sender is None else sender
      self.message = message
    else:
      if ':' not in text and ';' in text:
        text = text.replace(';',':')
      if ':' not in text:
        self.sender = text
        self.message = ''
        self.is_complete = False
      else:
        beginning, ending = text.split(':')
        self.sender = beginning.replace('###', '').strip()
        self.message = ending.strip()

  def __repr__(self):
    if self.is_complete:
      return f'### {self.sender}: {self.message}'
    else:
      return f'### {self.sender}'


class Conversation:

  NLI_MAX_MESSAGES = 10

  def __init__(self, who):
    self.who = who
    self.messages = []
    self.nli_message_count = 0
    self.current_sender = ME

  def prefix(self):
    result = f"<<SYS>>Write a realistic text message chat. Avoid repetition.<</SYS>>\n"
    result += f"[INST]Write a chat between {ME} and {self.who}[/INST]\n"
    return result

  def next_sender(self):
    self.current_sender = self.who if self.current_sender == ME else ME

  def add(self, message_contents):
    self.add_message(Message(message=message_contents, sender=self.current_sender))

  def add_message(self, message):
    self.messages.append(message)

  def add_prompt(self):
    self.add('')

  def nli(self):
    result = self.prefix()
    nlis = [message.__repr__() for message in self.messages[-Conversation.NLI_MAX_MESSAGES:]]
    self.nli_message_count = len(nlis)
    result += ' '.join(nlis)
    return result

  def __repr__(self):
    result = ""
    for message in self.messages:
      result += message.__repr__() + '\n'
    return result

  def process(self, language):
    language = language.replace('?:',':').replace('::',':')
    incoming = language.replace(' ###','###').split('### ')[1:]
    self.messages = self.messages[:-1] # remove the last message
    new_messages = incoming[self.nli_message_count-1:]
    for index, new_message in enumerate(incoming[self.nli_message_count-1:]):
      message = Message(text=new_message)
      if message.sender != self.current_sender and index != 0:
        return True
      else:
        self.add_message(message)
    return False

Game time

You’ve patiently curated your dataset, you’ve fine-tuned your Llama, you’ve spent $100. You are ready to run a simulation of yourself. Run this code, and hit Enter every time you want the LLM to generate a response. The LLM can act as you, or act as the other person in the chat, or as… both.

print('Who is the conversation with?')
who = input()

conversation = Conversation(who)
while True:
  print(f'{conversation.current_sender}: ')
  reply = input()
  if reply == 'stop':
    break
  elif reply != '':
    conversation.add(reply)
  else:
    conversation.add_prompt()
    ready = False
    while not ready:
      language, final_tokens = generate_next(conversation.nli(), 3, 8)
      ready = conversation.process(language)
      clear_output(wait=True)
      print(conversation)
      # print(final_tokens)
  conversation.next_sender()

I’ve commented out the print(final_tokens) line – this prints the last few lines of the conversation as tokens. If your model starts to produce spurious or repetitive characters or words/phrases, you can use this to identify the tokens and add them to the suppression lists.

Your reward

You can now enjoy yourself trying out chats. The subject matter can sometimes be on the bland side until the LLM is provoked with spicy questions, as you saw in my original post. Try asking the simulated version of you to reveal your deepest secrets!

For me, one of the more spooky aspects is how well the model adopts the tone and characteristics of my friends, even with everyday topics. Here’s my friend “James”, who’s good at making time for people and sticking to plans.

### Edward: Hey James
### James: Hey Ed. How are you?
### Edward: I'm great thanks. What are you up to
### James: Just in town for a few days. I'm going to be here until the 27th.
### James: I'm actually in the city now. Just at the gym
### Edward: Free to hang out while you're here?
### James: Sure. What are you doing?
### Edward: Just working on a blog post
### James: What time are you done?
### Edward: I'll be finished at 6pm - it would be great to see you tonight if you're free
### James: 6pm is perfect.

And here’s “John”, who has a more hurried style, over-uses commas, and is quite impossible to get to commit to anything:

### Edward: Hey John
### John: Hey, how are u?
### Edward: I'm great thanks. What are you up to
### John: I'm at work, waiting for a client, what's new?
### Edward: Not much, I'm working on a blog post
### John: Oh, that's cool, what's the topic?
### Edward: it's about generating text chats
### John: I'm interested, I've been working on my own web site, I have a lot of pics to post, but I need to figure out how to do it
### Edward: you know I'm good at things like that, I'd love to help
### John: I know, u have a lot of knowledge in this field, I'll ask u for help, I don't want to make it too hard to navigate
### John: I'll let u know when I'm ready
### Edward: I'd love that.. and are you free to hang out some time?
### John: I'm not sure, I'm trying to get back into work, I'll let u no when I can
### Edward: How about this weekend? It would be great to catch up
### John: I'll let u now, I'm trying not to be so flaky

It might not look great at first –

But please don’t be discouraged. Start with some debugging; triple check that the training datasets are are structured right. Immerse yourself in the Weights & Biases charts and confirm that things are moving the right direction. Change hyper-parameters, one at a time to start with. And most importantly: dig into the Text Generation process and look for improvements, such as suppressing more tokens.

You can try more options in the call to model.generate(), reading about them in the docs. I experimented with beam search, but I got the best results with the default settings above. I had some problems with lower ‘no_repeat_ngram_size’ causing longer conversations to go off the rails.. but you should try it.

num_beams=8,
num_return_sequences=1,
no_repeat_ngram_size=6,
repetition_penalty=1.0, # default
temperature = 1.0, # default
do_sample=True, # default

And finally, when you’re seeing decent results, switch to using the 13B variant of Llama 2. Fine-tuning will take longer of course, but I found the quality of results was substantially improved.

What’s next

I have so many ideas for more things to try! I’ve been exploring models at the top of the Hugging Face leaderboard, but so far the Llama 2 version is the best. I’m considering training another model on my Slack history — I’d be intrigued to see a conversation between Work me and Home me..

I hope you’ve enjoyed this journey and learned a thing or two along the way. If you’ve found bugs or improvements in my code, or better techniques or hyper-parameters, please get in touch. I’d love to try out your ideas and I’ll post any updates here.

I only have one request. If you reach out by text — be sure not to ask for my deepest secrets 😂

3 responses to “Fine-tuning an LLM on your texts: a simulation of you”

Elliot

February 7, 2024 at 12:46 pm

I’ve been waiting for today after stumbling on this blog last week. Honestly, I don’t understand what’s all going on, but this was a really cool project. Thank you so much for taking the time to help lay people experience the magic.

1. Edward Donner
  
  February 7, 2024 at 1:00 pm
  
  Thanks Elliot! More cool projects coming soon! For the background, I hugely recommend the videos from the masterful Jon Krohn, like this one and many more on his channel: https://www.youtube.com/watch?v=Ku9PM26Cc2c
  
  1. elliot
    
    February 7, 2024 at 2:29 pm
    
    I’ll definitely check that out, thanks for linking.
    
    After seeing the output, it’s impressive. Not perfect, but useful. Makes me wonder if it’s feasible to drop this into an app like LLM Farm on iOS, maybe create a fork of the repo and add a custom keyboard to automate responses?